agent architecture
EmergentCommunication
Recall that หmc(u) is exactly the listener's decoder in the IB framework (see Section 3.1.1). Therefore, anyother decoder would lend an upper bound on the informativeness loss term. Notice that under our assumptions,หmc is a Gaussian mixture, whereas the speaker's beliefs are simply Gaussian. All the systems with the samek form an equivalence class and the canonical system within each class is the one with minimalk. These canonical systems are the natural one to prefer, because they can attain the optimum for a given complexity with aminimal codebook.
Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol
Jobs, Niklas, da Silva, Luis Miguel Vieira, Somashekaraiah, Jayanth, Weigand, Maximilian, Kube, David, Gehlhoff, Felix
Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.
MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation
Rahmani, Mahdi, Saffari, AmirHossein, Rahmani, Reyhane
Small and medium - sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real - time engagement is essential for conversion. However, developing AI - driven chatbots for this purpose requires large, high - quality question - and - answer (Q&A) datasets, which are typically expensive and resource - intensive to produce, especially for low - resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales ch atbots in Telegram - based e - commerce. We propose a novel, automated multi - agent architecture that generates persona - aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval - augmented generation (RAG) models with our advanced agentic system, which features multi - query retrieval, reranking, and persona - aligned response synthesis. Using GPT - 5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high - quality datasets without relying on expensive human annotation or complex fine - tuning. MegaChat provides SMEs with an efficient, cost - effective solution for building intelligent customer engagement systems in specialized c ommercial domains, enabling advancements in multilingual conversational AI for low - resource languages.
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
AlShikh, Waseem, Ali, Muayad Sayed, Kennedy, Brian, Mozolevskyi, Dmytro
As AI agents proliferate across industries and applications, evaluating their performance based solely on infrastructural metrics such as latency, time-to-first-token, or token throughput is proving insufficient. These metrics fail to capture the quality of an agent's decisions, its operational autonomy, or its ultimate business value. This white paper proposes a novel, comprehensive framework of eleven outcome-based, task-agnostic performance metrics for AI agents that transcend domain boundaries. These metrics are designed to enable organizations to evaluate agents based on the quality of their decisions, their degree of autonomy, their adaptability to new challenges, and the tangible business value they deliver, regardless of the underlying model architecture or specific use case. We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE). Through a large-scale simulated experiment involving four distinct agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across five diverse domains (Healthcare, Finance, Marketing, Legal, and Customer Service), we demonstrate the framework's efficacy. Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model across the majority of our proposed metrics, achieving an average Goal Completion Rate of 88.8\% and the highest Return on Investment (ROI). This work provides a robust, standardized methodology for the holistic evaluation of AI agents, paving the way for more effective development, deployment, and governance.
AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents
Lu, Qinghua, Zhao, Dehai, Liu, Yue, Zhang, Hao, Zhu, Liming, Xu, Xiwei, Shi, Angela, Tan, Tristan, Kazman, Rick
The emergence of foundation models (FMs) has enabled the development of highly capable and autonomous agents, unlocking new application opportunities across a wide range of domains. Evaluating the architecture of agents is particularly important as the architectural decisions significantly impact the quality attributes of agents given their unique characteristics, including compound architecture, autonomous and non-deterministic behaviour, and continuous evolution. However, these traditional methods fall short in addressing the evaluation needs of agent architecture due to the unique characteristics of these agents. Therefore, in this paper, we present AgentArcEval, a novel agent architecture evaluation method designed specially to address the complexities of FM-based agent architecture and its evaluation. Moreover, we present a catalogue of agent-specific general scenarios, which serves as a guide for generating concrete scenarios to design and evaluate the agent architecture. We demonstrate the usefulness of AgentArcEval and the catalogue through a case study on the architecture evaluation of a real-world tax copilot, named Luna.
A cybersecurity AI agent selection and decision support framework
This paper presents a novel, structured decision support framework that systematically aligns diverse artificial intelligence (AI) agent architectures, reactive, cognitive, hybrid, and learning, with the comprehensive National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) 2.0. By integrating agent theory with industry guidelines, this framework provides a transparent and stepwise methodology for selecting and deploying AI solutions to address contemporary cyber threats. Employing a granular decomposition of NIST CSF 2.0 functions into specific tasks, the study links essential AI agent properties such as autonomy, adaptive learning, and real-time responsiveness to each subcategory's security requirements. In addition, it outlines graduated levels of autonomy (assisted, augmented, and fully autonomous) to accommodate organisations at varying stages of cybersecurity maturity. This holistic approach transcends isolated AI applications, providing a unified detection, incident response, and governance strategy. Through conceptual validation, the framework demonstrates how tailored AI agent deployments can align with real-world constraints and risk profiles, enhancing situational awareness, accelerating response times, and fortifying long-term resilience via adaptive risk management. Ultimately, this research bridges the gap between theoretical AI constructs and operational cybersecurity demands, establishing a foundation for robust, empirically validated multi-agent systems that adhere to industry standards.